{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Goal: Train a model using AutoML functionality! \n",
    "\n",
    "A popular approach to solve a machine learning problem is to try multiple approaches for training a model by running multiple algorithms on a dataset. Based on initial analysis, you can decide which algorithm to use for training and tuning the actual model. However, each algorithm can have specific feature requirements such as data must be numeric, missing values must be addressed before the training, etc. Performing algorithm specific feature engineering tasks can take time. Such a project can be shortened by running an AutoML algorithm that performs feature engineering tasks such as one-hot encoding, generalization, addressing missing values, automatically and then trains models using multiple algorithms in parallel.  \n",
    "\n",
    "This notebook demonstrates how to use such an AutoML algorithm offerd by [H2O.ai](https://aws.amazon.com/marketplace/seller-profile?id=55552124-d41b-4bad-90db-72d427682225) in AWS Marketplace for machine learning.  AutoML from H2O.ai trains one or more of following types of models in parallel:\n",
    "1. XGBoost GBM (Gradient Boosting Machine)\n",
    "2. GLM \n",
    "3. default Random Forest (DRF)\n",
    "4. Extremely Randomized Forest (XRT)\n",
    "5. Deep Neural Nets\n",
    "\n",
    "Once these models have been trained, it also creates two stacked ensemble models:\n",
    "1. An ensemble model created using all the models.\n",
    "2. Best of family ensemble model created using models that performed best in each class/family.\n",
    "\n",
    "For more information on how H2O.ai's AutoML works, see [FAQ section of H2O.ai's documentation.](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#faq)\n",
    "\n",
    "\n",
    "#### Compatibility\n",
    "This notebook is compatible only with [H2O-3 Automl Algorithm](https://aws.amazon.com/marketplace/pp/prodview-vbm2cls5zcnky) from AWS Marketplace and an AWS Marketplace subscription is required to successfully run this notebook. \n",
    "\n",
    "\n",
    "#### Usage instructions\n",
    "You can run this notebook one cell at a time (By using Shift+Enter for running a cell).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let us install necessary H2O.ai library which you would use to load and inspect the model summary.\n",
    "import sys\n",
    "!{sys.executable} -m pip install http://h2o-release.s3.amazonaws.com/h2o/rel-wright/10/Python/h2o-3.20.0.10-py2.py3-none-any.whl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import necessary libraries.\n",
    "import boto3\n",
    "import re\n",
    "import os\n",
    "import errno\n",
    "import base64\n",
    "import time\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import urllib\n",
    "from sagemaker import get_execution_role\n",
    "import json\n",
    "import uuid\n",
    "import sagemaker\n",
    "from time import gmtime, strftime\n",
    "import urllib.request\n",
    "from sagemaker import AlgorithmEstimator\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Subscribe to AutoML algorithm from AWS Marketplace\n",
    "\n",
    "1. Open [H2O-3 Automl Algorithm listing from AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-vbm2cls5zcnky?qid=1557245796960&sr=0-1&ref_=srh_res_product_title)\n",
    "2. Read the **Highlights** section and then **product overview** section of the listing.\n",
    "3. View **usage information** and then **additional resources**.\n",
    "4. Note the supported instance types and specify the same in the following cell.\n",
    "5. Next, click on **Continue to subscribe**.\n",
    "6. Review **End user license agreement**, **support terms**, as well as **pricing information**.\n",
    "7. Next, \"Accept Offer\" button needs to be clicked only if your organization agrees with EULA, pricing information as well as support terms. Once **Accept offer** button has been clicked, specify compatible training and inference types you wish to use. \n",
    "\n",
    "**Notes**: \n",
    "1. If **Continue to configuration** button is active, it means your account already has a subscription to this listing.\n",
    "2. Once you click on **Continue to configuration** button and then choose region, you will see that a product ARN will appear. This is the algorithm ARN that you need to specify in your training job. However, for this notebook, the algorithm ARN has been specified in **src/algorithm_arns.py** file and you do not need to specify the same explicitly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "compatible_training_instance_type='ml.c5.4xlarge' \n",
    "\n",
    "compatible_inference_instance_type='ml.c5.2xlarge' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2 : Set up environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker as sage\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "# Specify S3 prefixes\n",
    "common_prefix = \"automl-iris\"\n",
    "training_input_prefix = common_prefix + \"/training-input-data\"\n",
    "training_output_prefix = common_prefix + \"/training-output\"\n",
    "batch_inference_input_prefix = common_prefix + \"/batch-inference-input-data\"\n",
    "\n",
    "#Create session - The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.\n",
    "sagemaker_session = sage.Session()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Specify algorithm ARN for H2O.ai's AutoML algorithm from AWS Marketplace.  However, for this notebook, the algorithm ARN \n",
    "#has been specified in src/scikit_product_arns.py file and you do not need to specify the same explicitly.\n",
    "\n",
    "from src.algorithm_arns import AlgorithmArnProvider\n",
    "\n",
    "algorithm_arn = AlgorithmArnProvider.get_algorithm_arn(sagemaker_session.boto_region_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, configure the S3 bucket name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket=sagemaker_session.default_bucket()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, specify your name to tag resources you create as part of this experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "created_by='your_name'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Prepare and upload data \n",
    "\n",
    "Now that you have identified the algorithm you want to run, you need to prepare  data that is compatible with your algorithm. This notebook demonstrates AutoML using the Iris data set (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science). Note that we will be adding a missing value  to the first row to demonstrate that AutoML would take care of missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Background - The Iris dataset\n",
    "\n",
    "The [Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) contains\n",
    "150 rows of data, comprising 50 samples from each of three related Iris species:\n",
    "*Iris setosa*, *Iris virginica*, and *Iris versicolor*. \n",
    "\n",
    "![Petal geometry compared for three iris species: Iris setosa, Iris virginica, and Iris versicolor](https://www.tensorflow.org/images/iris_three_species.jpg) **From left to right,\n",
    "[*Iris setosa*](https://commons.wikimedia.org/w/index.php?curid=170298) (by\n",
    "[Radomil](https://commons.wikimedia.org/wiki/User:Radomil), CC BY-SA 3.0),\n",
    "[*Iris versicolor*](https://commons.wikimedia.org/w/index.php?curid=248095) (by\n",
    "[Dlanglois](https://commons.wikimedia.org/wiki/User:Dlanglois), CC BY-SA 3.0),\n",
    "and [*Iris virginica*](https://www.flickr.com/photos/33397993@N05/3352169862)\n",
    "(by [Frank Mayfield](https://www.flickr.com/photos/33397993@N05), CC BY-SA\n",
    "2.0).**\n",
    "\n",
    "Each row contains the following data for each flower sample:\n",
    "[sepal](https://en.wikipedia.org/wiki/Sepal) length, sepal width,\n",
    "[petal](https://en.wikipedia.org/wiki/Petal) length, petal width, and flower\n",
    "species. \n",
    "\n",
    "Sepal Length | Sepal Width | Petal Length | Petal Width | Species\n",
    ":----------- | :---------- | :----------- | :---------- | :-------\n",
    "5.1          | 3.5         | 1.4          | 0.2         | setosa\n",
    "4.9          | 3.0         | 1.4          | 0.2         | setosa\n",
    "4.7          | 3.2         | 1.3          | 0.2         | setosa\n",
    "&hellip;     | &hellip;    | &hellip;     | &hellip;    | &hellip;\n",
    "7.0          | 3.2         | 4.7          | 1.4         | versicolor\n",
    "6.4          | 3.2         | 4.5          | 1.5         | versicolor\n",
    "6.9          | 3.1         | 4.9          | 1.5         | versicolor\n",
    "&hellip;     | &hellip;    | &hellip;     | &hellip;    | &hellip;\n",
    "6.5          | 3.0         | 5.2          | 2.0         | virginica\n",
    "6.2          | 3.4         | 5.4          | 2.3         | virginica\n",
    "5.9          | 3.0         | 5.1          | 1.8         | virginica\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "training_data_location='data/training/iris.csv'\n",
    "urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',training_data_location)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us look at the sample training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head $training_data_location"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us add a header and a copy of first line to demonstrate that the AutoML listing takes care of missing values as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sed -i '1s/^/sepal_length,sepal_width,petal_length,petal_width,species\\n,,1.4,0.2,Iris-setosa\\n/' $training_data_location"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head $training_data_location"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which the notebook downloads from the source. \n",
    "\n",
    "We can use use the tools provided by the Amazon SageMaker Python SDK to upload the data to an S3 bucket. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_input = sagemaker_session.upload_data(training_data_location, bucket, key_prefix=training_input_prefix)\n",
    "print (\"Training Data Location \" + training_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Train a model\n",
    "Next, let us train a model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algo = AlgorithmEstimator(algorithm_arn=algorithm_arn, \n",
    "                          role=role, \n",
    "                          train_instance_count=1, \n",
    "                          train_instance_type=compatible_training_instance_type, \n",
    "                          sagemaker_session=sagemaker_session, \n",
    "                          output_path='s3://{}/{}/'.format(bucket,training_output_prefix), \n",
    "                          base_job_name='automl',\n",
    "                          hyperparameters={\"max_models\": \"30\",\n",
    "                          \"training\": \"{'classification': 'true', 'target': 'species'}\"},\n",
    "                          tags=[{\"Key\":\"created_by\",\"Value\":created_by}]) \n",
    "\n",
    "# Note: Apart from classification and target variables, you can also specify following additional parameter to\n",
    "# indicate categorical columns.\n",
    "#'categorical_columns': '<comma>,<separated>,<list>' \n",
    "\n",
    "algo.fit({'training': training_input}) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Review the leaderboard available in the log to understand how each of the top 10 models performed. By default, the metrics are based on 5-fold cross validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5. Analyze the trained model\n",
    "\n",
    "With H2O.ai's library, we can understand the model performance. Let us download the model, extract it, and then load the model in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client('s3')\n",
    "key= algo.model_data.replace(\"s3://\"+bucket+\"/\",\"\")\n",
    "s3.download_file(bucket,key,'model/model.tar.gz')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let us extract the file and save the name in a variable\n",
    "file_name = !tar -xzvf model/model.tar.gz -C model/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#First retrieve the artifacts from S3 and then load them\n",
    "import h2o\n",
    "h2o.init()\n",
    "aml_model = h2o.load_model(\"model/\"+file_name[0])\n",
    "print(aml_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Review:\n",
    "1. **Model Details** section \n",
    "2. **Confusion matrix** \n",
    "3. **Cross-Validation Metrics Summary** using an editor.\n",
    "4. **Variable Importances section** available at the end of the summary. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 6: Deploy the model and perform a real-time inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from sagemaker.predictor import csv_serializer\n",
    "predictor = algo.deploy(1, compatible_inference_instance_type, serializer=csv_serializer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us view a sample from original training data and create a sample payload based on one of the entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tail $training_data_location"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us pick a row, modify values slightly, and then perform an inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload=\"sepal_length,sepal_width,petal_length,petal_width\"+\"\\n\"+\"6.0,3.1,5.2,1.9\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that data has been prepared, let us perform a real-time inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(predictor.predict(payload).decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Congratulations!**, you have successfully performed a real-time inference on the model you trained using H2O.ai's AutoML algorithm! Check whether it predicted the correct class. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have finished performing predictions, you can delete the endpoint to avoid getting charged for the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algo.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 7. Perform a batch inference \n",
    "Batch transform uses a trained model to get inferences on a dataset that is stored in Amazon S3, and saves the inferences in an S3 bucket that you specify. For more information, see [Get Inferences for an Entire Dataset with Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)\n",
    "\n",
    "Next, let us prepare data and then write the payload to a file for running a batch transform job.\n",
    "\n",
    "For this notebook, we will specify values that are close to first and last row from the training data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Create a batch payload\n",
    "\n",
    "batch_payload=\"sepal_length,sepal_width,petal_length,petal_width\"+\"\\n\"+\"6.2,3.1,5.1,1.9\"+\"\\n\"+\"5.0,3.4,1.3,0.1\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Write payload to a file and then use the same for creating a batch transform job.\n",
    "\n",
    "fileHandle = open('data/batch-transform/input.csv', \"w\")\n",
    "fileHandle.write(batch_payload)\n",
    "fileHandle.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, upload the file written to an S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transform_input = sagemaker_session.upload_data('data/batch-transform/input.csv', key_prefix=batch_inference_input_prefix)\n",
    "print(\"Transform input uploaded to \" + transform_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, invoke the batch-transform job using the payload file you uploaded to S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer = algo.transformer(1, compatible_inference_instance_type,  output_path=\"s3://\"+bucket+\"/\"+batch_inference_input_prefix)\n",
    "transformer.transform(transform_input, content_type='text/csv')\n",
    "transformer.wait()\n",
    "print(\"Batch Transform output saved to \" + transformer.output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "key_path= transformer.output_path.replace(\"s3://\"+bucket+\"/\",\"\")\n",
    "key_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.resource('s3')\n",
    "\n",
    "obj = s3.Object(bucket, key_path+\"/input.csv.out\")\n",
    "\n",
    "data= obj.get()['Body'].read().decode('utf-8') \n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have reviewed the prediction, you can delete the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Finally, delete the model you created.\n",
    "predictor.delete_model()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, if the AWS Marketplace subscription was created just for the experiment and you would like to unsubscribe to the product, here are the steps that can be followed.\n",
    "Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model-package or using the algorithm. Note - You can find this by looking at container associated with the model. \n",
    "\n",
    "Steps to un-subscribe to product from AWS Marketplace:\n",
    "1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=lbr_tab_ml)\n",
    "2. Locate the listing that you would need to cancel subscription for, and then __Cancel Subscription__ can be clicked to cancel the subscription.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook demonstrated how to perform AutoML with Amazon Sagemaker using H2O.ai's AutoML listing from AWS Marketplace. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
